ERG 2081 Final Report: Drug Mechanism of Action (MoA) Prediction Model

Introduction of Drug Mechanism of Action (MoA):

     In the past, scientists derived drugs from natural products or were inspired by traditional remedies. Drugs were put into clinical use decades before the biological mechanisms driving their pharmacological activities were understood. With the advent of more powerful technologies, today, drug discovery has changed from the serendipitous approaches of the past to a more targeted model based on an understanding of the underlying biological mechanism of a disease.

     In this new framework, scientists seek to identify the mechanism of action of new drugs, which usually includes mention of the specific molecular targets to which the drug binds, such as an enzyme or receptor. By knowing the interaction between a certain site of a drug and a receptor, other drugs can be formulated in a way that replicates this interaction, thus producing the same therapeutic effects. This method is used to create new drugs.

Motivation: How do we better predict drug-target interactions (DTIs) to determine the MoAs(Mechanism of action) of a new drug?

     One approach is to treat a sample of human cells with the drug and then analyze the cellular responses with algorithms that search for similarity to known patterns in large genomic databases, such as libraries of gene expression or cell viability patterns of drugs with known MoAs (e.g. Connectivity Map and DepMap).

     In this project, I build a model from the train set to automatically labels each case in the test set as one or more MoA classes, which is a multi-label classification problem. I evaluate the model based on the average value of the logarithmic loss function applied to each drug-MoA annotation pair in the test set. Details of datasets will be provided below.

Data Discription:

     There are several databases that are available in providing a unique dataset that combines gene expression and cell viability data. CMAP and LINCS are the two that were most used in previous research.

     However, both CMAP and LINCS required each cell type to be profiled separatel and previous efforts have largely focused on studying responses in a small number of cell line contes. But perturbation responses areoften context-specific, reflecting their interaction with the cell’s underlying genomic or functional features. For example, targeted drugs may elicit responses only in cell lines harboring particular oncogenic mutations or expressing certain genes, making observed results specific to the particular cell line models chosen. 1.&hl=zh-CN&as_sdt=0&as_vis=1&oi=scholart)

     In 2020, a new approach to present gene expression signatures across multiple cell lines in parallel is developed as MIX-Seq. MIX-Seq combines:

  1. the ability to pool hundreds of cancer cell lines and co-treat them with one or more perturbations;

  2. the power of scRNA-seq to simultaneously profile the cells’ responses and resolve the identity of each cell based on single-nucleotide polymorphism (SNP) profiles.

     The profile of each perturbation (specified by drug, time, dose and control type) is created by measuring simultaneously (within the same samples) human cells’ responses to drugs in a pool of more than 100 different cell types, which tackled profiling the responses of many diverse cell lines to a given perturbation and thus assessed context-specific effects.

Following the instruction provided by the MOA prediction task on one kaggle competition, I access DepMap and get the PRISM Repurposing 19Q3 Primary Screen dataset and try to understand how datasets in this competition comes. The primary PRISM Repurposing dataset contains the results of pooled-cell line chemical-perturbation viability screens for 4,518 existing drugs screened against 578 cancer cell lines.

For each perturbation (specified by a sig_id, used to querey the known MOA notation from Dr. Insight package in R, I randomly select 100 out of 578 cancer cell lines c0-c99 and took out the most correlated differentially expressed genes (setting a p-value threshold at 0.05) g0-g771. This is a unique dataset that combines gene expression and cell viability data. Then I split thwm into training features set and testing features set.

Features g- signify gene expression data, and c- signify cell viability data. cp_type indicates samples treated with a compound (cp_vehicle) or with a control perturbation (ctrl_vehicle); control perturbations have no MoAs; cp_time and cp_dose indicate treatment duration (24, 48, 72 hours) and dose (high or low).

To get the corresponding known MOA notation of each drug, I use Dr.Insight, which is a useful package in R that provides pathway analysis where the mechanisms of identified drugs are investigated at pathway level. Since DepMap and Dr.Insight both have the NCI pathway interaction database (PID) embeded, calling

and rearranging by setting drug-pathway adjacency table with values ‘1’, ‘0’ ‘-1’ (where ‘1’ means a drug up-regulates a pathway, ‘-1’ means down-regulation and ‘0’ means no significant regulation relationship), I got the training and target dataset.

Final goal:

Use the training dataset to develop an algorithm that automatically labels each case in the test set as one or more MoA classes.

This is a multi-label classification problem. (Each class is defined by k of n labels)

Model prediction goal:

For every sig_id you will be predicting the probability that the sample had a positive response for each MOA target.

Assessment of the model:

For N sig_id rows and M MOA targets, the model will be making N×M predictions. Model are assessed by the log loss:

$score = -\frac{1}{M}\sum^{M}_{m=1}{\frac{1}{N}\sum^{N}_i=1}[y_{i,m} log(\hat y_{i,m}) + (1-y_{i,m})log(1-\hat y_{i,m})]$

Data set:

There are three files in total:

train_features.csv - Features for the training set. Features g- signify gene expression data, and c- signify cell viability data. cp_type indicates samples treated with a compound (cp_vehicle) or with a control perturbation (ctrl_vehicle); control perturbations have no MoAs; cp_time and cp_dose indicate treatment duration (24, 48, 72 hours) and dose (high or low).

train_drug.csv - This file contains an anonymous drug_id for the training set only.

train_targets_scored.csv - The binary MoA targets that are scored.

train_targets_nonscored.csv - Additional (optional) binary MoA responses for the training data. These are not predicted nor scored.

Base line of EDA:

  1. Basic data overview
  2. Categories visualization
  3. Gene and cell features distribution
  4. Checking features correlation
  5. Targets analysis on the training set
  6. Train & Targets correlations: find the most correlated features for every target column.

This indicates there are high correlation between features. Now try to find pairs of features with high correlation.

In total we have 35 columns that have correlation with at least another 1 higher than 0.9.

Now let's check target and do some basic analysis on the feature-target relationship.

The biggest number of positive samples for 1 target column is 3.5%. So we deal here with highly imbalanced data.

We can see here that about 40% of sample have zeros in all columns and more than 50% have only one active target column.

After looking at the feature and target data repectively, it's time to find the most correlated features(gene differential expression or cell expression variaty) for every target column (every single checkpoint/label in MOA).

Let's see what is the higher value (absolute) of correlation for target columns with every column from train set. Every column on chart is max correlation of current target column with all of columns from training set.

Now let's see what columns from training set have the higher number of "high" correlations with target columns. Every row from chart means that column times has the best value of correlation with different target columns.

Let's select some random target columns and see how they deal with pairs of the highly correlated features.

We can extract several group names from target column names. Looks like that last term in column name is definition of a group. Let's extact them and visualize groups with number of columns > 1.

Conclusion of basic EDA:

Feature engineering:

Now I check distributions of g- and c- of train and test set to see whether they are spicky or normally distributed.

train set before using RankGauss

test set before using RankGauss

From the plot I found that they are spiky distribution rather than normal distribution. Regardless of the train and test, they look be in the same shape.

It appears that the gene expression data and cell viability data can be controlled by the experimenter, so it is safe to assume that these data are independent of each other.

Also, since the shape of the distribution is close to normal distribution to begin with, I don't think there is much of a problem if it is forced to be transformed into a Gaussian distribution. Therefore I add RankGauss here and confirm the shape again here.

It appears that I have transformed the distribution of each data to resemble a normal distribution, as intended.

Dimensionality Reduction (PCA)

PCA is a linear transformation that projects the data into another space, where vectors of projections are defined by variance of the data. PCA results can be evaluated with reconstruction error and cumulative percent variance.

Since it's a high dimensional feature space, first we choose the optimal number of PCA.

In this case, to get 95% of variance explained I need 600 principal components.

Next stage: improve CV score of the NN model built later

With number of pc decided, I am thinking about how to make use of the PCA set and did clustering to improve the CV score and Lb score of my NN model used to make multiclass prediction. I got inspired from the notebook posted on kaggle https://www.kaggle.com/namanj27/new-baseline-pytorch-moa?scriptVersionId=42917782. It provides the following ideas:

In the next report, I follow the idea of this notebook, and integrate it with the scaling process that I did just now. I added the idea of RankGauss to g- and c- columns for preprocessing, which processed the data to normal distribute-like data, and see it's effect to scores.

I also made the following changes based on this notebook after several trials (which are the strategies that I found most useful) and trained three models repectively to compare the results:

Detail codes and result are in the second coding notebook.

REFERENCES

[1] https://clue.io/repurposing

[2] https://depmap.org/portal/download/Lamb

[3] https://lincsproject.org/

[4] https://www.kaggle.com/namanj27/new-baseline-pytorch-moa?scriptVersionId=42917782

[5] http://fastml.com/preparing-continuous-features-for-neural-networks-with-rankgauss/

[6] https://scrnaseq-course.cog.sanger.ac.uk/website/cleaning-the-expression-matrix.html

[7] J, Crawford ED, Peck D, Modell JW, Blat IC, Wrobel MJ, Lerner J, Brunet J-P, Subramanian A, Ross KN, et al. The Connectivity Map: using gene-expression signatures to connect small molecules, genes, and disease. Science. 2005;313(5795):1929–35.

[8] Corsello, S.M., Nagari, R.T., Spangler, R.D. et al. Discovering the anticancer potential of non-oncology drugs by systematic viability profiling. Nat Cancer 1, 235–248 (2020). https://doi.org/10.1038/s43018-019-0018-6

[9] Subramanian, A., Narayan, R., Corsello, S. M., Peck, D. D., Natoli, T. E., Lu, X., Gould, J., Davis, J. F., Tubelli, A. A., Asiedu, J. K., Lahr, D. L., Hirschman, J. E., Liu, Z., Donahue, M., Julian, B., Khan, M., Wadden, D., Smith, I. C., Lam, D., Liberzon, A., … Golub, T. R. (2017). A Next Generation Connectivity Map: L1000 Platform and the First 1,000,000 Profiles. Cell, 171(6), 1437–1452.e17. https://doi.org/10.1016/j.cell.2017.10.049

[10] Townes, F.W., Hicks, S.C., Aryee, M.J. et al. Feature selection and dimension reduction for single-cell RNA-Seq based on a multinomial model. Genome Biol 20, 295 (2019). https://doi.org/10.1186/s13059-019-1861-6